SQ: To what extent do socioeconomic conditions predict variation in subway crime rates?
Introduction
This project examines the question, “Is there a correlation between crimes in the MTA and the stations where they occur?” Using NYPD subway crime complaint data alongside MTA ridership statistics, the broader goal is to understand not just where subway crime occurs, but whether observed patterns reflect differences in exposure, station usage, or neighborhood characteristics. Prior research has shown that subway crime declined during COVID and rebounded unevenly as ridership returned, raising questions about how safety has evolved in the post-pandemic period.
Within this broader framework, my subquestion focuses on the role of neighborhood socioeconomic conditions. Specifically, I examine whether subway crime rates differ across ZIP codes with varying median household incomes after adjusting for ridership. By linking crime incidents to ZIP code–level income data and normalizing by rider volume, this analysis explores whether socioeconomic context helps explain variation in subway crime beyond raw counts alone.
Data Acquisition
To prepare the NYPD complaint data for analysis, we restricted the dataset to incidents that occurred specifically within the subway system and applied several cleaning steps to improve data quality and spatial accuracy. First, we standardized variable names, converted complaint dates into usable formats, and extracted the year and month for later analysis. Because assigning complaints to ZIP codes requires precise geographic information, we removed all records missing valid latitude or longitude values. We then filtered out administrative and non-criminal incidents, retaining only penal-law complaint types relevant to the study of subway crime. Duplicate complaint numbers were removed to avoid double-counting, and records with coordinates falling outside the geographic bounds of New York City were excluded. The resulting dataset consists of unique, accurately geocoded subway crime complaints that can be reliably mapped to ZIP codes and merged with socioeconomic data, supporting the construction of accurate neighborhood-level crime rates.
To examine whether socioeconomic conditions help explain variation in subway crime, we rely on median household income as a neighborhood-level indicator. Median income is widely used in social science research as a summary measure of economic resources and living conditions, and it provides a consistent, comparable benchmark across New York City ZIP Code Tabulation Areas (ZCTAs, but for the purpose of this project we will call them zipcodes). To obtain these data programmatically, we use the U.S. Census Bureau’s American Community Survey (ACS) 2023 5-year estimates, accessed through the tidycensus interface, which provides reproducible API-based retrieval. After downloading the ACS table containing median household income, we retain only the essential fields—ZCTA identifiers and estimated income values, rename them using analysis friendly variable names, and save a cleaned dataset for later merging with crime data.
Code
# ACS income setuplibrary(tidycensus)library(dplyr)library(readr)acs_year <-2023# Download ACS 2023 median household income by ZCTA (ACS 5-year)acs_income_raw <-get_acs(geography ="zcta",variables ="B19013_001", # Median household incomeyear = acs_year,survey ="acs5",geometry =FALSE,cache_table =TRUE)glimpse(acs_income_raw)# Clean, rename, and remove ZIPs with missing incomeacs_income_clean <- acs_income_raw %>%transmute(Zipcode = GEOID,Income = estimate ) %>%filter(!is.na(Income)) write_csv( acs_income_clean,file.path("data/courseproject", "acs_2023_income_zcta.csv"))glimpse(acs_income_clean)
Code
library(dplyr)library(DT)library(htmltools)# Use the same global Calibri + title + striping CSS as the crime tablecustom_css <- tags$style(HTML(" table.dataTable, table.dataTable th, table.dataTable td, div.dataTables_info, div.dataTables_paginate, div.dataTables_filter, div.dataTables_length, .dataTables_wrapper { font-family: Calibri, sans-serif !important; } caption { background-color: #0039A6 !important; color: white !important; padding: 10px; font-size: 120% !important; font-weight: bold !important; caption-side: top !important; text-align: center !important; } table.dataTable tbody tr:nth-child(odd) { background-color: white !important; } table.dataTable tbody tr:nth-child(even) { background-color: #efefef !important; } table.dataTable tbody tr:hover { background-color: #e0e0e0 !important; }"))# Display-friendly datasetincome_display <- acs_income_clean %>%transmute(Zipcode = Zipcode,`Median Income`= Income )income_dt <-datatable( income_display,rownames =FALSE,caption ="Median Household Income by Zip Code (ACS 2023)",options =list(pageLength =10,lengthChange =FALSE,autoWidth =TRUE,scrollX =FALSE,dom ='tpif',columnDefs =list(list(width ='40%', targets =0), # Zipcodelist(width ='60%', targets =1) # Median Income ) ),class ='display compact') %>%formatCurrency("Median Income",currency ="$",interval =3,mark =",",digits =0 ) %>% htmlwidgets::prependContent(custom_css)income_dt
Joining Data Sets
To link crimes to neighborhood conditions, we first assign each crime incident to a ZIP Codeusing its latitude and longitude. This is done by performing a spatial join between the point locations of crimes and the polygon boundaries of Zipcodes obtained from the U.S. Census. Once each crime has a ZIP code, we merge the crime data with the ACS income data so that every incident carries the median income of its surrounding neighborhood. From this joint dataset, we then compute ZIP-level summaries and present them in an interactive, styled table that shows both median income and total subway crime by ZIP code.
Code
# Crime + Zip + Income joint dataset and detailed tablelibrary(tigris)library(sf)library(dplyr)library(DT)library(htmltools)options(tigris_use_cache =TRUE)# 1. Get ZCTA polygons and detect the ID columnzctas <-zctas(year =2020)zctas_sf <-st_transform(zctas, crs =4326)id_candidates <-c("GEOID", "GEOID10", "GEOID20", "ZCTA5CE10", "ZCTA5CE20")zip_col <-intersect(names(zctas_sf), id_candidates)[1]if (is.na(zip_col) ||length(zip_col) ==0) {stop("Could not find a suitable ZCTA ID column in zctas_sf.")}zctas_sf_small <- zctas_sf[, zip_col, drop =FALSE]# 2. Convert crime_clean to sf points using longitude and latitudecrime_sf <-st_as_sf( crime_clean,coords =c("longitude", "latitude"),crs =4326,remove =FALSE)# 3. Spatial join: assign ZCTA ID to each crime pointcrime_with_zcta <-st_join( crime_sf, zctas_sf_small,join = st_within,left =TRUE)# 4. Drop geometry, rename ZCTA ID column to Zipcodecrime_with_zip <- crime_with_zcta %>%st_drop_geometry()names(crime_with_zip)[names(crime_with_zip) == zip_col] <-"Zipcode"# 5. Merge crime data with ACS income by Zipcodecrime_with_income <- crime_with_zip %>%left_join(acs_income_clean, by ="Zipcode")# 6. Keep only rows with valid ZIP and Income and select key variablescrime_detail_full <- crime_with_income %>%filter(!is.na(Zipcode), !is.na(Income))wanted_cols <-c("cmplnt_num","date","borough","offense","Law_cat_cd","Zipcode","Income")crime_detail_full <- crime_detail_full %>%select(any_of(wanted_cols))crime_detail_table <- crime_detail_full# 7. Build interactive detailed crime-level datatablecrime_dt <-datatable( crime_detail_table,rownames =FALSE,colnames =gsub(pattern ="_",replacement =" ",x =names(crime_detail_table) ),options =list(pageLength =10,autoWidth =TRUE,scrollX =FALSE, #scrolling offdom ='<"top">t<"bottom"ipf>',initComplete =JS("function() {"," var api = this.api();"," var $table = $(api.table().container());"," var $filter = $table.find('div.dataTables_filter');"," $filter.appendTo($table.find('div.bottom'));","}" ) ),class ="cell-border stripe hover") %>% {if ("Income"%in%names(crime_detail_table)) {formatCurrency( .,"Income",currency ="$",interval =3,mark =",",digits =0 ) } else { . } }crime_detail_widget <-tagList( tags$style(HTML(" div.dataTables_wrapper { width: 800px !important; margin-left: auto; margin-right: auto; } table.dataTable { font-family: Calibri, Arial, sans-serif; } /* allow cell text to wrap instead of forcing horizontal scroll */ table.dataTable tbody td { white-space: normal !important; } table.dataTable.stripe tbody tr.even { background-color: #f2f2f2; } table.dataTable.stripe tbody tr.odd { background-color: #ffffff; } ")), tags$div(style ="width:800px; margin-left:auto; margin-right:auto;", tags$div(style =paste0("background-color:", mta_blue, ";","color:white;","font-family:Calibri, Arial, sans-serif;","font-size:20px;","text-align:center;","padding:8px;","border-radius:4px;","margin-bottom:8px;" ),"Subway Crime Records with Zip Code and Median Income" ), crime_dt ))crime_detail_widget
Subway Crime Records with Zip Code and Median Income
Ridership Data Download
For ridership, we use the MTA’s published average weekday ridership for each station in a given year. This measure reflects the typical number of riders using a station on a weekday and does not include weekend travel. While this means it does not capture total subway usage, average weekday ridership provides a stable and commonly used proxy for overall transit activity. It is less sensitive to short term fluctuations and reporting noise, which makes it appropriate for neighborhood level comparisons over time.
The ridership data was downloaded directly from the MTA’s published Excel file so the analysis could be done with a consistent, up-to-date source. After loading the file, the dataset was cleaned by keeping only real station rows, selecting the ridership years from 2018 to 2023, renaming the columns, and removing extra headers. Ridership numbers were rounded, and a new average ridership value was created along with a rank based on that average.
Because the ridership file did not include geographic coordinates, a separate MTA origin-destination dataset was used to obtain latitude and longitude for subway station complexes. Those coordinates were then used to assign each station to a ZIP Code (ZCTA) through spatial matching. The ZIP codes allow station ridership to be linked to neighborhood-level data, making it possible to study how subway usage varies across different areas and to connect ridership patterns to local socioeconomic characteristics later in the analysis.
Code
## Ridership Data Downloadlibrary(httr2)library(readxl)library(dplyr)library(readr)library(DT)library(htmltools)dir.create("data/courseproject", recursive =TRUE, showWarnings =FALSE)station_ridership_url <-"https://www.mta.info/document/137106"station_ridership_path <-file.path("data", "courseproject", "station_ridership.xlsx")# Download the Excel file once, then reuse itif (!file.exists(station_ridership_path)) {request(station_ridership_url) |>req_perform(path = station_ridership_path)}# Read the FIRST sheet (sheet name in the file is a bit messy)station_ridership_raw <-read_excel(station_ridership_path)station_ridership_clean <- station_ridership_raw %>%rename(Station =`Average Weekday Subway Ridership`,Borough = ...3,rid_2018 = ...4,rid_2019 = ...5,rid_2020 = ...6,rid_2021 = ...7,rid_2022 = ...8,rid_2023 = ...9,Rank_2023 = ...12 ) %>%filter(!is.na(Station), Station !="Station (alphabetical by borough)" ) %>%filter(!is.na(Borough), Borough %in%c("Bx", "B", "M", "Q", "SI") ) %>%select( Station, Borough, rid_2018, rid_2019, rid_2020, rid_2021, rid_2022, rid_2023 ) %>%# round ridership to whole numbersmutate(across(starts_with("rid_"), ~round(., 0)) ) %>%# average across 2018–2023 and rank by that averagemutate(Avg_2018_2023 =round(rowMeans(across(starts_with("rid_")), na.rm =TRUE), 0),Avg_Rank =min_rank(dplyr::desc(Avg_2018_2023)) )# 2023-only version for normalization laterstation_ridership_2023 <- station_ridership_clean %>%select( Station, Borough,Ridership_2023 = rid_2023 )# Save cleaned datasetswrite_csv( station_ridership_clean,file.path("data/courseproject", "station_ridership_annual_2018_2023_clean.csv"))write_csv( station_ridership_2023,file.path("data/courseproject", "station_ridership_2023_clean.csv"))# Interactive tableif (!exists("mta_blue")) { mta_blue <-"#0039A6"}ridership_dt <-datatable( station_ridership_clean,rownames =FALSE,colnames =c("Station","Borough","2018","2019","2020","2021","2022","2023","2018–2023 Avg.","Avg. Rank" ),options =list(pageLength =10,autoWidth =FALSE,scrollX =FALSE,dom ='<"top">t<"bottom"ipf>',columnDefs =list(list(width ="220px", targets =0), list(width ="40px", targets =1), list(width ="75px", targets =2:8), list(width ="70px", targets =9), list(className ="dt-right", targets =2:9) ),initComplete =JS("function() {"," var api = this.api();"," var $table = $(api.table().container());"," var $filter = $table.find('div.dataTables_filter');"," $filter.appendTo($table.find('div.bottom'));","}" ) ),class ="cell-border stripe hover compact")ridership_widget <-tagList( tags$style(HTML(" div.dataTables_wrapper { max-width: 950px; width: 100%; margin-left: auto; margin-right: auto; } table.dataTable th, table.dataTable td { padding: 4px 6px; } table.dataTable { font-family: Calibri, Arial, sans-serif; } table.dataTable.stripe tbody tr.even { background-color: #f2f2f2; } table.dataTable.stripe tbody tr.odd { background-color: #ffffff; } ")), tags$div(style ="max-width:950px; width:100%; margin-left:auto; margin-right:auto;", tags$div(style =paste0("background-color:", mta_blue, ";","color:white;","font-family:Calibri, Arial, sans-serif;","font-size:20px;","text-align:center;","padding:8px;","border-radius:4px;","margin-bottom:8px;" ),"Annual Station Ridership, 2018–2023" ), ridership_dt ))ridership_widget
Code
## Station complex lookup (ID, name, lat, lon) from API DOWNLOAD ORIGIN DATA FOR COMPLEX LAT LONGlibrary(httr2)library(readr)library(dplyr)dir.create("data/courseproject", recursive =TRUE, showWarnings =FALSE)stations_path <-file.path("data/courseproject", "od_station_complexes_2024.csv")base_url <-"https://data.ny.gov/resource/jsu2-fbtj.csv"app_token <-Sys.getenv("SOCRATA_APP_TOKEN") # optionalreq <-request(base_url) |>req_url_query(`$select`=paste("origin_station_complex_id","origin_station_complex_name","origin_latitude","origin_longitude",sep ="," ),`$where`="origin_station_complex_id is not null",`$limit`=50000 ) |>req_user_agent("STA9750-courseproject") |>req_retry(max_tries =5)if (nzchar(app_token)) { req <- req |>req_headers(`X-App-Token`= app_token)}tmp <-tempfile(fileext =".csv")req |>req_perform(path = tmp)od_stations <-read_csv(tmp, show_col_types =FALSE) |>transmute(complex_id = origin_station_complex_id,Station = origin_station_complex_name,lat =as.numeric(origin_latitude),lon =as.numeric(origin_longitude) ) |>filter(!is.na(complex_id), !is.na(lat), !is.na(lon)) |>distinct(complex_id, .keep_all =TRUE) |>arrange(Station)write_csv(od_stations, stations_path)glimpse(od_stations)
Code
# zipcode assign to stationlibrary(sf)library(dplyr)library(readr)library(tigris)options(tigris_use_cache =TRUE)lookup_path <-file.path("data/courseproject", "station_complex_zip_lookup.csv")od_path <-file.path("data/courseproject", "od_station_complexes_2024.csv")# read OD stationsod_stations <-read_csv(od_path, show_col_types =FALSE)stations_sf <- od_stations %>%st_as_sf(coords =c("lon", "lat"), crs =4326, remove =FALSE)# Use ALL NY ZCTAs (no bbox filter). This is still fine for 424 points.ny_zctas <- tigris::zctas(state ="NY", year =2010) %>%st_transform(4326) %>%select(ZCTA = ZCTA5CE10)# 1) primary join (intersects)stations_joined <-st_join(stations_sf, ny_zctas, join = st_intersects, left =TRUE)# 2) fallback: if ZCTA is NA, assign nearest ZCTA polygonmissing_idx <-which(is.na(stations_joined$ZCTA))if (length(missing_idx) >0) { nearest <-st_nearest_feature(stations_joined[missing_idx, ], ny_zctas) stations_joined$ZCTA[missing_idx] <- ny_zctas$ZCTA[nearest]}station_zip_lookup <- stations_joined %>%st_drop_geometry() %>%transmute( complex_id, Station,Zipcode = ZCTA ) %>%distinct(complex_id, .keep_all =TRUE) %>%arrange(Station)write_csv(station_zip_lookup, lookup_path)station_zip_lookup %>%summarise(total =n(), missing_zip =sum(is.na(Zipcode)))
To analyze subway crime patterns in a consistent way across neighborhoods and across time, we aggregated the data at the ZIP Code and year level. Each subway crime incident was assigned to a ZIP Code using its latitude and longitude, and crimes were then counted within each ZIP Code for each year from 2018 through 2023. This allowed us to compare crime levels across neighborhoods.
Station level average weekday ridership values were summed within each year across all stations located in the same ZIP Code to create a ridership by Zip Code measure. This ensures that crime counts for a given year are normalized using ridership from the same year, rather than mixing ridership levels across different time periods.
We then normalized crime counts by dividing the number of crime complaints by total ZIP Code ridership and scaling the result to crimes per 100,000 average weekday rides. This normalization step is important because ZIP Codes serve very different numbers of riders, and overall subway usage changed substantially over the study period, especially during the COVID era.
Code
library(dplyr)library(sf)library(tigris)library(tidyr)options(tigris_use_cache =TRUE)# Step 1: Standardize ZIP codes as 5-digit character strings for consistent joinsfix_zip <-function(x) {sprintf("%05s", as.character(x))}# Step 2: Load NY ZCTA polygons and prepare a 5-digit Zipcode fieldny_zctas <- tigris::zctas(state ="NY", year =2010) %>%st_transform(4326) %>%select(Zipcode = ZCTA5CE10) %>%mutate(Zipcode =fix_zip(Zipcode))# Step 3: Convert crime records into spatial points using longitude and latitudecrime_sf <- crime_clean %>%filter(!is.na(latitude), !is.na(longitude)) %>%st_as_sf(coords =c("longitude", "latitude"), crs =4326, remove =FALSE)# Step 4: Spatially join crimes to ZCTAs to assign each crime a Zipcodecrime_joined <-st_join(crime_sf, ny_zctas, join = st_intersects, left =TRUE)# Step 5: If any crimes did not fall inside a polygon, assign the nearest ZCTA as a fallbackmiss_idx <-which(is.na(crime_joined$Zipcode))if (length(miss_idx) >0) { nearest <-st_nearest_feature(crime_joined[miss_idx, ], ny_zctas) crime_joined$Zipcode[miss_idx] <- ny_zctas$Zipcode[nearest]}# Step 6: Drop geometry and keep a clean crime dataset with Zipcodecrime_with_zip <- crime_joined %>%st_drop_geometry() %>%mutate(Zipcode =fix_zip(Zipcode))# Step 7: Standardize ridership ZIP codes so they match the crime ZIP formatstation_ridership_with_zip <- station_ridership_with_zip %>%mutate(Zipcode =fix_zip(Zipcode))# Step 8: Convert station-level annual ridership columns into a Zipcode-year tableridership_zip_year <- station_ridership_with_zip %>%select(Zipcode, rid_2018, rid_2019, rid_2020, rid_2021, rid_2022, rid_2023) %>%pivot_longer(cols =starts_with("rid_"),names_to ="year",values_to ="ridership" ) %>%mutate(year =as.integer(sub("rid_", "", year)) ) %>%group_by(Zipcode, year) %>%summarise(zip_ridership =sum(ridership, na.rm =TRUE),.groups ="drop" )# Step 9: Count crimes by Zipcode-year and compute crimes per 100,000 ridescrime_zip_year <- crime_with_zip %>%filter(year %in%2018:2023) %>%group_by(Zipcode, year) %>%summarise(crime_count =n(),.groups ="drop" ) %>%left_join(ridership_zip_year, by =c("Zipcode", "year")) %>%mutate(crime_per_100k_rides =1e5* crime_count / zip_ridership )# Step 10: Attach the Zipcode-year normalized rate back onto each crime recordcrime_with_rate <- crime_with_zip %>%left_join( crime_zip_year %>%select(Zipcode, year, crime_per_100k_rides),by =c("Zipcode", "year") )
Some ZIP code–year combinations appear in the table with missing values for both ridership and the normalized crime rate. These rows correspond to ZIP codes where no subway stations are located, and therefore no station-level ridership is recorded for that year. Because crimes per 100,000 rides cannot be computed without a valid ridership denominator, the normalized crime rate is undefined for these ZIP codes. We retain these rows in the table for transparency, but they are excluded from all analyses that require normalized crime rates, including correlation tests and mapping.
Code
library(DT)library(htmltools)library(dplyr)custom_css <- tags$style(HTML(" table.dataTable, table.dataTable th, table.dataTable td, div.dataTables_info, div.dataTables_paginate, div.dataTables_filter, div.dataTables_length, .dataTables_wrapper { font-family: Calibri, sans-serif !important; } caption { background-color: #0039A6 !important; color: white !important; padding: 10px; font-size: 120% !important; font-weight: bold !important; caption-side: top !important; text-align: center !important; } table.dataTable tbody tr:hover { background-color: #e0e0e0 !important; }"))# Prepare data for displaycrime_zip_display <- crime_zip_year %>%mutate(crime_per_100k_rides =round(crime_per_100k_rides, 2),zip_ridership =round(zip_ridership, 0) ) %>%transmute(Zipcode = Zipcode,Year = year,`Crime Complaints`= crime_count,`Total Daily Ridership`= zip_ridership,`Crimes per 100,000 rides`= crime_per_100k_rides ) %>%arrange(Year, desc(`Crimes per 100,000 rides`))datatable( crime_zip_display,rownames =FALSE,caption ="Crime per Ridership by ZIP Code, 2018–2023",options =list(pageLength =10,lengthChange =FALSE,autoWidth =TRUE,scrollX =FALSE,dom ="tpif",columnDefs =list(list(width ="12%", targets =0),list(width ="10%", targets =1),list(width ="18%", targets =2),list(width ="18%", targets =3),list(width ="20%", targets =4),list(className ="dt-right", targets =2:4) ),# Color rows by year using a light orange gradientrowCallback =JS("function(row, data) {"," var year = parseInt(data[1]);"," var colors = {"," 2018: '#fff3e6',"," 2019: '#ffe6cc',"," 2020: '#ffd9b3',"," 2021: '#ffcc99',"," 2022: '#ffbf80',"," 2023: '#ffb366'"," };"," if (colors[year]) {"," $(row).css('background-color', colors[year]);"," }","}" ) ),class ="display compact") %>% htmlwidgets::prependContent(custom_css)
This code creates a ZIP Code level dataset that links subway crime exposure to neighborhood income, and then visualizes the relationship. First, crimes are grouped by ZIP Code and year, and the total number of complaints is counted for each ZIP year combination. That crime count is then joined to the matching ZIP year ridership totals, and a normalized rate is computed as crimes per 100,000 rides by dividing crimes by ridership and scaling the result.
Next, the code collapses the data across time by taking the average of the yearly normalized crime rates for each ZIP Code from 2018 to 2023. This produces one summary crime rate per ZIP Code that reflects typical crime exposure over the full study period, rather than focusing on a single year that might be unusual. Median household income from the ACS is then joined to this ZIP level summary so that each ZIP Code has both an income value and an average normalized crime rate, and ZIP Codes with missing values are removed.
After building this final ZIP level dataset, the code runs a Spearman correlation test between income and the average crimes per 100,000 rides. Spearman is used because it tests whether there is a consistent increasing or decreasing relationship between the two variables without requiring the relationship to be perfectly linear or the data to be normally distributed. Finally, the scatterplot visualizes this same comparison by plotting median household income on the x axis and the average crimes per 100,000 rides on the y axis. Each point represents one ZIP Code, so the graph shows whether higher or lower income ZIP Codes tend to have higher or lower crime rates after adjusting for ridership. The fitted trend line is added to summarize the overall direction of the relationship across all ZIP Codes.
Map of Median Income by Zipcode and Crime Rate
Code
library(dplyr)library(sf)library(tigris)library(leaflet)library(htmltools)library(scales)options(tigris_use_cache =TRUE)# This helper makes sure ZIP codes are always 5 digit textfix_zip <-function(x) sprintf("%05s", as.character(x))# This function builds the interactive ZIP mapcreate_interactive_zip_map <-function(zip_summary) {# Step 1: Clean ZIP codes so joins work zip_summary_clean <- zip_summary %>%mutate(Zipcode =fix_zip(Zipcode),Income =as.numeric(Income),avg_crime_per_100k_rides =as.numeric(avg_crime_per_100k_rides),total_crimes =as.numeric(total_crimes) ) %>%filter(!is.na(Zipcode),!is.na(Income),!is.na(avg_crime_per_100k_rides),!is.na(total_crimes) )# Step 2: Download NY ZCTA polygons and keep only ZIPs in our dataset ny_zctas <- tigris::zctas(state ="NY", year =2010) %>%st_transform(4326) %>%transmute(Zipcode =fix_zip(ZCTA5CE10)) zctas_joined <- ny_zctas %>%left_join(zip_summary_clean, by ="Zipcode") %>%filter(!is.na(Income), !is.na(total_crimes), !is.na(avg_crime_per_100k_rides))# Step 3: Create points for bubble placement zctas_points <-st_point_on_surface(zctas_joined)# Step 4: Build the income color palette pal_income <-colorNumeric(palette =colorRampPalette(c("red", "yellow", "green"))(256),domain = zctas_joined$Income,na.color ="transparent" )# Step 5: Scale bubble radius directly from total crimes crime_min <-min(zctas_points$total_crimes, na.rm =TRUE) crime_max <-max(zctas_points$total_crimes, na.rm =TRUE) scale_radius <-function(x, min_r =3, max_r =20) {if (is.na(x)) return(min_r)if (crime_min == crime_max) return((min_r + max_r) /2) min_r + (x - crime_min) * (max_r - min_r) / (crime_max - crime_min) } zctas_points <- zctas_points %>%mutate(radius =vapply(total_crimes, scale_radius, numeric(1)))# Step 6: Create popup text zctas_joined <- zctas_joined %>%mutate(popup_text =paste0("<b>ZIP Code:</b> ", Zipcode, "<br>","<b>Median household income (ACS 2023):</b> ", dollar(Income, accuracy =1), "<br>","<b>Total crimes (2018–2023):</b> ", comma(total_crimes), "<br>","<b>Avg. crimes per 100,000 rides (2018–2023):</b> ", round(avg_crime_per_100k_rides, 2), "<br>","<b>Total ridership (sum of annual avg weekday ridership, 2018–2023):</b> ", comma(round(total_ridership, 0)) ) ) zctas_points <- zctas_points %>%left_join( zctas_joined %>%st_drop_geometry() %>%select(Zipcode, popup_text),by ="Zipcode" )# Step 7: Add a simple note explaining bubble scaling bubble_note <-paste0("<div style='background: white; padding: 10px; border-radius: 6px; border: 1px solid #ccc;'>","<div style='font-weight: bold; margin-bottom: 6px;'>Bubble size</div>","<div style='font-size: 12px;'>","Bubble radius is scaled linearly to total crimes (2018–2023).<br>","<b>Min:</b> ", comma(round(crime_min, 0)), " crimes<br>","<b>Max:</b> ", comma(round(crime_max, 0)), " crimes","</div>","</div>" )# Step 8: Make the mapleaflet() %>%addProviderTiles(providers$CartoDB.Positron) %>%# Step 9: Add ZIP polygons colored by incomeaddPolygons(data = zctas_joined,fillColor =~pal_income(Income),fillOpacity =0.55,color ="#444444",weight =1,opacity =0.8,popup =~popup_text,group ="ZIP income" ) %>%# Step 10: Add bubbles sized by total crimes and labeled with crime countsaddCircleMarkers(data = zctas_points,radius =~radius,fillColor ="black",fillOpacity =0.25,color ="black",weight =1,popup =~popup_text,label =~paste0("Crimes: ", comma(total_crimes)),labelOptions =labelOptions(direction ="top",textsize ="12px",opacity =0.9 ),group ="Crime count bubbles" ) %>%# Step 11: Add an income legendaddLegend(position ="bottomright",pal = pal_income,values = zctas_joined$Income,title ="Median income (ACS 2023)",labFormat =labelFormat(prefix ="$", big.mark =",", digits =0),opacity =0.95 ) %>%# Step 12: Add the bubble scaling noteaddControl(html =HTML(bubble_note),position ="bottomleft" ) %>%# Step 13: Add layer controlsaddLayersControl(overlayGroups =c("ZIP income", "Crime count bubbles"),options =layersControlOptions(collapsed =FALSE) )}interactive_zip_map <-create_interactive_zip_map(zip_summary)interactive_zip_map
Spearman Rank Correlation Analysis
To formally assess the relationship between neighborhood socioeconomic conditions and subway crime rates, we apply Spearman’s rank correlation test. Spearman’s correlation is a nonparametric measure of association, meaning it does not require the variables to follow a normal distribution or assume a linear relationship. Instead, it evaluates whether two variables move together in a consistent increasing or decreasing pattern. This makes the test well suited for this analysis, as both median household income and normalized subway crime rates vary widely across ZIP codes and may be skewed by extreme values. Each observation represents a ZIP Code Tabulation Area. For each ZIP code, median household income from the 2023 American Community Survey is paired with the average subway crime rate per 100,000 average weekday rides, calculated across the years 2018 through 2023. Normalizing crime by ridership ensures that neighborhoods with higher subway usage are not mechanically associated with higher crime counts. The Spearman test works by ranking ZIP codes from lowest to highest income and separately ranking them from lowest to highest normalized crime rates. The test then evaluates whether higher income ranks tend to be associated with higher or lower crime rate ranks.
The hypotheses for the test are defined as follows:
H₀: There is no monotonic relationship between median household income and normalized subway crime rates across ZIP codes (ρ = 0).
Hₐ: There is a monotonic relationship between median household income and normalized subway crime rates across ZIP codes (ρ ≠ 0).
The results of the Spearman correlation test yield a correlation coefficient of ρ = −0.3819, with a test statistic S = 368,854 and a p value of 2.482 × 10⁻⁵. The negative value of the correlation indicates that ZIP codes with higher median household income tend to have lower subway crime rates per 100,000 rides. Because the p value is far below standard significance thresholds, we reject the null hypothesis. Overall, these findings provide strong evidence of a statistically significant negative association between neighborhood income levels and normalized subway crime rates. While this analysis does not imply causation, it suggests that socioeconomic conditions are closely related to variation in subway crime across New York City neighborhoods.
Because normalized crime rates can become extreme in ZIP codes with very low ridership, we conduct a robustness check using a rule-of-thumb outlier procedure. Specifically, we identify outliers in the average crimes per 100,000 rides variable using the interquartile range (IQR) criterion and temporarily exclude ZIP codes falling outside 1.5×IQR from the Spearman correlation test. This sensitivity analysis is not used as the primary result, but rather to verify that the observed relationship is not driven by a small number of extreme rate values. The direction and statistical significance of the correlation remain substantively unchanged after trimming, indicating that the main findings are robust to reasonable outlier handling choices.
# plotggplot(zip_summary_trimmed, aes(x = Income, y = avg_crime_per_100k_rides)) +geom_point(alpha =0.7) +geom_smooth(method ="lm", se =FALSE) +scale_x_continuous(labels =dollar_format(accuracy =1)) +labs(title ="Median Income vs Subway Crime Rate by ZIP Code (IQR-Trimmed)",x ="Median household income (ACS 2023)",y ="Average crimes per 100,000 rides (2018–2023)" )
Conclusion
A key limitation of this analysis is the imperfect measurement of exposure. Subway ridership data are not available at the ZIP code level, so station ridership had to be spatially assigned to ZIP codes based on station locations. This approach does not fully capture how riders move across neighborhoods or where trips actually begin and end, which may introduce measurement error in the normalized crime rates.
Future work could improve this analysis by incorporating additional socioeconomic indicators beyond median income, such as unemployment, housing instability, or educational attainment. Finally, organizing and analyzing crime by offense type could reveal whether certain crimes are more strongly associated with specific neighborhood characteristics, providing a more nuanced understanding of how socioeconomic conditions relate to subway crime risk for policy initiatives.